##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

Introduction

I am going to examine the data and what all variables and attributes it contains.

This report explores a dataset containing attributes for 4,898 white wines with 13 which includes 11 variables on quantifying the chemical properties of each wine.

Univariate Plots

## [1] 4898
## [1] 13

There are 4898 rows and 13 columns. I have also reomved serial number giver by ‘X’ which is not very meaningful in our analysis.

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are only numerical values and integer values in the data set. We might have to change some of the variable types in our analysis(specificaly ‘quality’).

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

This gives us the central tendencies of all variables.

I am going to evaluate each variables in the following section to examine the distributions.

This is a noraml curve and gives a fair understanding of the distribution. This distribution is unimodal with the fixed acidity peaking around 6.8. There were some outliers before fixed acidity value of 4 and beyond 10 which has been removed. According to waterhouse most wines have tartaric acid value between 1 g/dm^3 and 4 g/dm^3. Is there a strong correlation between fixed acidity and pH value? Now let’s explore what the plots look like for other variables.

This is also a unimodal, peaking around volatile acidity value of 0.28. Waterhouse claims that average acetic acid value is less than 400 mg/L. This is in sync with our dataset. The legal limit of acetic acid in US for white wine is 1.1 g/dm^3. Too much acetic acid can result in unpleasant aromas. In addition to undesirable aromas, both acetic acid and acetaldehyde are toxic to Saccharomyces cerevisiae and may lead to stuck fermentations.

This distribution is also normal with citric acid value peaking around 0.3. Why is there a sudden peak at arounf 0.49?

According to waterhouse one would expect to see 0 to 500mg/L citric acid. This might be why the value peaks at around 0.49-0.5.

I observe a long tail distribution there are some extreme outliers around 30s and 70s which has been removed in the graph. According to winefolly.com: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine

We can conclude that most of the wines in the data set are Dry wines.

A dry wine is when the yeast eats up all the sugar that is available and makes ethanol as a by product. This is why some sweet wines have less alcohol than its dry counterpart. We can look at the correlation between residual sugar content and alcohol. Is this an inverse relationship?

The transformed distribution is bimodal and peaks at two places. First around 4 and then around 9. What do these peaks represent?

Majority of the values lies between 0 and 1. This is also a normal distribution with peak at around 0.4. Most wines have a salt content of less than 0.1.

Free Sulfur Dioxide seems like a normal distribution with its peak at approximately 30. Most wines have a Sulphur Dioxide content of less than 100.

Total Sulfur Dioxide Value is a normal distribution with a peak around 120s. Sulfites is used to preserve wines. Most people can easily digest sulfites but some people have extremem allergic reactions to sulfites. According to waterhouse the average sulfite content in wine is around 80 mg/L which is almost in sync with the dataset. S02 content above 50 is detectable in the nose and taste of wine. Given this, there are lots of wine in the dataset where SO2 content might become evident in the nose and taste of wine.

Density seems to follow a normal distribution with peak at nearly 0.992. There are a few outliers as well.

pH seems to follow a normal distribution with peak at nearly 3.15. According to Dr.Vinny’s post in winespectartor.com, the ideal pH value for white wines is around 3.0-3.4.

Normal distribution with a peak at .5. Potassium sulphate is the additive which will contribute to sulfur dioxide gas, which acts as an antimicrobial and antioxident.

White wines have a distribution between 8.5% and 14%, with concentration between 9% and 10.5%.

Most of the wines are given a quality score of 6. These values might be biased in many ways as it is a sensory data and completely subjective. The data might vary if a different set of experts is used for this.

Let’s look at all variable valus by quality:

The fixed acidity (tartarc acid) for wines of different quality peaks between 6 and 8 g/L

This does not give any particular insight as such. Volatile acidity value of all quality types peaks around 0.2.

Citric acid graph also does not provide us any particular insights. There’s a peak around 0.5 which was examined earlier.

This plot does not give any particular insight.

There’s a general peak between 20 and 40. This does not give us any key takeaways.

This gives us no particular insights.

This gives us no particular insights.

This gives us no particular insights.

This is the only plot which offers us some insights in this section. As realised throughout our analysis, alcohol has a meaningful correlation with quality.

I am going to create a new variable called Dryness based on the literature available online.

Most of the wines in out datasets belong to the dry category.

Univariate Analysis

What is the structure of your dataset?

The data set consists of 4,898 variants of the Portuguese White Wine “Vinho Verde”, with measurements of eleven chemical properties:

Fixed Acidity: acid that contributes to the conservation of wine. Volatile Acidity: Amount of acetic acid in wine at high levels can lead to an unpleasant taste of vinegar. Citric Acid: found in small amounts, can add “freshness” and flavor to wines. Residual sugar: amount of sugar remaining after the end of the fermentation. Chlorides: amount of salt in wine. Free Sulfur Dioxide: it prevents the increase of microbes and the oxidation of the wine. Total Sulfur Dioxide: it shows the aroma and taste of the wine. Density: density of water, depends on the percentage of alcohol and amount of sugar. pH: describes how acid or basic a wine is on a scale of 0 to 14. Sulfates: additive that acts as antimocrobian and antioxidant. Alcohol: percentage of alcohol present in the wine.

And a sensorial property: - Quality: grade between 0 and 10 given by specialists.

Observations: - Most wines have medium quality (5 and 6) - There’s no evident predictor of quality from the univariate analysis

What is/are the main feature(s) of interest in your dataset?

The main features in the data set is quality which is also our dependent variable. I’d like to determine which features are best for predicting the quality of wine. I suspect some combination of the chemical properties variables can be used to build a predictive model to determine the quality of White wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It is very difficult to predict quality from the given variable at first glance. I did not notice any significant relationship even after facet wrapping various variables according to quality. Perhaps I could investigate further by taking residual sugar relations with other properties as a starting point to further my investigation.

Did you create any new variables from existing variables in the dataset?

I created a new variable called dryness which is based on the residual sugar content as mentioned below: < 1 g/L(d/dm^3) - Bone Dry 1 to 10 g/L - Dry 10 to 35 g/L - Off-Dry 35 to 120 g/L - Sweet Wine 120 to 220 g/L - Very Sweet Wine

Most of the wines are Dry in nature.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

It was necessary to remove anomalies and extreme vales in some cases for better visualisations. Some properties like residual sugar and density had extreme values. In addition, the residual sugar of the white wine presented a long tail distribution. I used log10 transformation and got a bimodal distribution.

Bivariate Plots Section

I am going to start with a clean pair.panels plot to examine some key relationships, plots and correlation between variables.

Now I am going to plot scatter plots to analyse relationship between our feature of focus(quality) and chemical properties. I will also be jittering these plots to give a better perspective.

An initial decreasing trend followed by an increasing trend was observed. For alcohol of quality below 6 it is a negative relationship.

After removing outliers, this looks like a negative relationship. According to the literature available online, I was able to confirm this relationship.

After removing outliers, this looks like a negative relationship. According to the literature available online, I was able to confirm this relationship.

No linear relationship was observed. There are no takeaways from this plot.

No linear relationship was observed. There are no takeaways from this plot.

A decreasing trend was observed. This is also logically sound sinc fermentation of sugar results in more alcohol. From our analysis, it is fair to state that the higher alcohol content the better the quality of wine.

No linear relationship was observed. There are no takeaways from this plot.

No linear relationship was observed. There are no takeaways from this plot.

A decreasing trend was observed.

No linear relationship was observed. There are no takeaways from this plot.

No linear relationship was observed. There are no takeaways from this plot.

A positive trend was observed.

According to Waterhouse the total acidity is the sum of fixed and volatile acidity

It is clear from above that alcohol has the strongest correlation with quality. Here are the noteworthy correlations involving quality. I had to utilize the integer version of the quality variable in order to calculate the correlations.

Quality and alcohol: 0.436 Quality and density: -0.307

However, both these correlations can’t be considered strong.

Let’s take a look at boxplots involving quality.

Only alcohol and density have a meaningful relationship with quality score. I have arranged both these plots below.

I am going to analyse quality by taking into consideration central tendencies of density to see this relationship better.

## # A tibble: 7 x 6
##   quality mean_density median_density min_density max_density     n
##     <int>        <dbl>          <dbl>       <dbl>       <dbl> <int>
## 1       3    0.9948840       0.994425     0.99110     1.00010    20
## 2       4    0.9942767       0.994100     0.98920     1.00040   163
## 3       5    0.9952626       0.995300     0.98722     1.00241  1457
## 4       6    0.9939613       0.993660     0.98758     1.03898  2198
## 5       7    0.9924524       0.991760     0.98711     1.00040   880
## 6       8    0.9922359       0.991640     0.98713     1.00060   175
## 7       9    0.9914600       0.990300     0.98965     0.99700     5

The median data again show that as quality increases, density values decrease.

In addition to evaluating the correlations related to quality, I also want to probe how other variables work with each other. Here are the correlations of note that do not involve quality:

Total sulfur dioxide and residual sugar: 0.401 Total sulfur dioxide and free sulfur dioxide: 0.616 Total sulfur dioxide and alcohol: -0.449 Density and residual sugar: 0.839 Alcohol and density: -0.780 Residual sugar and alcohol: -0.451 Fixed acidity and pH: -0.426

Density, alcohol, and residual sugar all appear to be strongly correlated to each other, so I am going to take a closer look at those plots.

The correlations are very evident in the charts shown above. Sugar must be more dense than other ingredients in the wine, because higher density levels imply higher sugar quanity. Similarly, alcohol seems to imply lesser density. Lastly, alcohol and sugar may offset each other during the wine-making process, because lower levels of alcohol tend to have higher levels of sugar (and vice versa)

I also wants to make a special note about pH levels and acidity. All Three acidity values have strong correlation with pH. This is logical as higher pH value corresponds to lower acidity.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I evaluated all the variables with out main feature variable quality and observed that alcohol content has a strong impact on quality. However, it is still loosely correlated. Another variable that slightly influence quality may be the density.

Initially, as alcohol content increases, quality decreases. Subsequently when alcohol content increases, quality increases. This is not a linear model as represented by the smoothing line.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I discovered strong correlations between alcohol, residual sugar and density. As alcohol content increases, density tends to decrease rather linearly. Furthermore, as residual sugar increases density also increases. A linear model fits this well. Finally, as residual sugar level rises alcohol level decreases. This was clarified by the literatue available online. I mainly referred to literature provided by waterhouse.

What was the strongest relationship you found?

The strongest correlation was seen between Density and Residual Sugar.

Multivariate Plots Section

You can see that the graph generally gets darker to the right. And the corellation between alcohol and quality and density and quality is evident.

Given the sae quality, win without sulfur aroma is more likely to have higher alcohol level. For instance, wines that have a quality score of 6 and don’t have sulfur smell, the median alcohol by volume is 10.6% as compared to 9.6 % among wines with same quality score with evident sulfur smell represented by blue boxplots. Therefore, you are more likely to get better quality wine if sulfur level is unnoticeable.

I am going to try to construct a linear model to predict th quality score based on the chemical properties.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = ww)
## m2: lm(formula = quality ~ alcohol + density, data = ww)
## m3: lm(formula = quality ~ alcohol + density + chlorides, data = ww)
## m4: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity, 
##     data = ww)
## m5: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity, data = ww)
## m6: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH, data = ww)
## m7: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide, data = ww)
## m8: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar), 
##     data = ww)
## m9: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) + 
##     citric.acid, data = ww)
## m10: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) + 
##     citric.acid + free.sulfur.dioxide, data = ww)
## m11: lm(formula = quality ~ alcohol + density + chlorides + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + log(residual.sugar) + 
##     citric.acid + free.sulfur.dioxide + sulphates, data = ww)
## 
## ==================================================================================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11       
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               2.582***    -22.492***    -21.150***    -31.387***    -47.652***    -47.870***    -43.543***     41.731***     42.639***     37.700***     47.757***  
##                            (0.098)       (6.165)       (6.162)       (6.355)       (6.195)       (6.222)       (6.510)      (11.223)      (11.284)      (11.294)      (11.437)    
##   alcohol                   0.313***      0.360***      0.343***      0.356***      0.405***      0.406***      0.408***      0.310***      0.308***      0.310***      0.296***  
##                            (0.009)       (0.015)       (0.015)       (0.015)       (0.015)       (0.015)       (0.015)       (0.018)       (0.019)       (0.019)       (0.019)    
##   density                                24.728***     23.671***     34.437***     50.909***     51.237***     46.805***    -39.975***    -40.902***    -36.049**     -46.226***  
##                                          (6.079)       (6.074)       (6.293)       (6.137)       (6.199)       (6.501)      (11.351)      (11.414)      (11.422)      (11.567)    
##   chlorides                                            -2.382***     -2.421***     -1.323*       -1.334*       -1.399**      -0.762        -0.808        -0.831        -0.818     
##                                                        (0.558)       (0.555)       (0.539)       (0.540)       (0.541)       (0.541)       (0.544)       (0.542)       (0.541)    
##   fixed.acidity                                                      -0.087***     -0.101***     -0.103***     -0.103***     -0.027        -0.029        -0.020        -0.014     
##                                                                      (0.014)       (0.014)       (0.015)       (0.015)       (0.017)       (0.017)       (0.017)       (0.017)    
##   volatile.acidity                                                                 -2.085***     -2.088***     -2.112***     -2.117***     -2.101***     -1.981***     -1.953***  
##                                                                                    (0.110)       (0.111)       (0.111)       (0.110)       (0.112)       (0.114)       (0.114)    
##   pH                                                                                             -0.031        -0.042         0.326***      0.332***      0.343***      0.317***  
##                                                                                                  (0.081)       (0.081)       (0.090)       (0.090)       (0.090)       (0.090)    
##   total.sulfur.dioxide                                                                                          0.001*        0.000         0.000        -0.001        -0.001*    
##                                                                                                                (0.000)       (0.000)       (0.000)       (0.000)       (0.000)    
##   log(residual.sugar)                                                                                                         0.225***      0.226***      0.210***      0.232***  
##                                                                                                                              (0.024)       (0.024)       (0.024)       (0.025)    
##   citric.acid                                                                                                                               0.075         0.057         0.037     
##                                                                                                                                            (0.097)       (0.096)       (0.096)    
##   free.sulfur.dioxide                                                                                                                                     0.004***      0.004***  
##                                                                                                                                                          (0.001)       (0.001)    
##   sulphates                                                                                                                                                             0.502***  
##                                                                                                                                                                        (0.099)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.190         0.192         0.195         0.202         0.256         0.256         0.257         0.270         0.270         0.274         0.278     
##   adj. R-squared            0.190         0.192         0.195         0.201         0.255         0.255         0.256         0.269         0.269         0.272         0.276     
##   sigma                     0.797         0.796         0.795         0.792         0.764         0.764         0.764         0.757         0.757         0.755         0.754     
##   F                      1146.395       583.290       396.315       309.222       336.912       280.734       241.554       225.827       200.787       184.336       170.797     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -5839.391     -5831.127     -5822.011     -5802.684     -5629.932     -5629.861     -5627.322     -5584.491     -5584.187     -5570.814     -5557.825     
##   Deviance               3112.257      3101.773      3090.247      3065.956      2857.136      2857.053      2854.093      2804.611      2804.262      2788.992      2774.238     
##   AIC                   11684.782     11670.255     11654.021     11617.368     11273.865     11275.722     11272.645     11188.982     11190.373     11165.629     11141.649     
##   BIC                   11704.272     11696.241     11686.504     11656.348     11319.341     11327.694     11331.114     11253.948     11261.836     11243.588     11226.105     
##   N                      4898          4898          4898          4898          4898          4898          4898          4898          4898          4898          4898         
## ==================================================================================================================================================================================

No combinations of variables coulg give a good model to predict quality score. The R2 value is very low evn after including all variables. This is not a strong correlation.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In this section, I tried to visualise some of the variables more concisely and precisely. Some of the insights into relationships between alcohol, density and residual sugars were strengthened.

Were there any interesting or surprising interactions between features?

It is interesting to note that the chemical properties trends of wines og 5 and below quality is almost the inverse of chemical property trends of wines of quality 6 and above. This might be due to the influence of an unknown variable which is not given in the dataset. Alternatively, there might be something that I have missed. The use of artificial flavouring and other chemical agents might give the same chemical properties for the low quality wines but different tastes.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I tried to fit a linear model into the dataset to predict the quality of white wine based on the features provided in the data set.

The model grew stronger as I added more features into the model. However, the linear model may not be the best way to represent this data. R2 values were too low and residuals were high. Using all the features provided is not very different from using only alcohol as a predictor, which was tried in the bivariate section. This might be because some of the features are correlated to each other.

To improve the model we might need to introduce new features into the model or new way to transform the data. Moreover, there might be a better method than linear to predict quality.

Final Plots and Summary

Final Plot 1

The strongest correlation observed between the feature of interest and any other feature was with alcohol at 0.436. This relationship can be visualised using the above chart. We can see that the concentration of points is increasing from left to right. That means as alcohol level increases quality also increases. Taking a closer look at th box plot we realise that the increasing trend is not steady. Between quality 3 and 5 it is a negative relationship. It is also safe to assume that after 12.5% alcohol content the quality of wine will decrease because the alcohol taste will overpower the native wine taste.

Final Plot 2

This is a good visualisation of the relationship between alcohol, density and quality. I have removed the outliers to make the visualisation better.

Alcohol and density is a negative relationship. That means as alchol content increases density decreases. Also, the better quality wines are concentrated at the left top of the graph. The graph disperses in the middle and converges at the right bottom. This also hints that as density increases, quality of wine tends to decrease.

Refection

The White Wines dataset contains information of 4898 samples of Portugese white wine (Vinho Verde) across 11 chemical properties and a special feature called quality score which was evaluated by wine experts. I started by exploring individual variables in the dataset and went on to investigate relationship between each chemical property with quality, which was chosen as the main feature in my analysis. Eventually, I tried to create a linear model to predict the quality of wine given other chemical properties.

There was a trend between quality and alcohol. But the other variables did not produce a strong correlation with quality. However, the variables were more or less strongly correlated with each other. Thgis might also be the reason why I was not able to come up with a linear model that predicts the quality score straight away. Transformations might be a technique that might have worked but I could not identify a direction to go forward with. Alternatively, absence of other features in the data set might also be a reason why I wasn’t able to produce a good linear model in my analysis.

Some limitations of this data includes missing features like Glycerol, Tannin, Amino acids, minerals, etc. Another limitation is that the quality score is a very subjective indicator. A more robust database could have produced a better model.

Having said that, this is the first project in R. I have so much to learn and I am sure that as the course progresses I will be able to deliver better.